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Abstract 


This paper introduces the mixture general diagnostic model (MGDM), an extension of the 
general diagnostic model (GDM). The MGDM extension allows one to estimate diagnostic 
models for multiple known populations as well as discrete unknown, or not directly observed 
mixtures of populations. The GDM is based on developments that integrate located latent 
class models; multiple classification latent class models; and discrete, multidimensional item 
response models into a common framework. Models of this type express the probability of a 
response vector as a function of parameters that describe the individual item response 
variables in terms of required skills and of indirectly observed (latent) skill profiles of 
respondents. The skills required for solving the items are, as in most diagnostic models, 
represented as a design matrix that is often referred to as a Q-matrix. This Q-matrix consists 
of rows describing, for each item response, what combination of skills is needed to succeed 
or to obtain partial or full credit. The hypothesized Q-matrix is either the result of experts 
rating items of an existing assessment (retrofitting) or comes directly out of the design of the 
assessment instrument, in which it served as a tool to design the items. 

The MGDM takes the GDM and integrates it into the framework of discrete mixture 
distribution models for item response data (see von Davier & Rost, 2006). This increases the 
utility of the GDM by allowing the estimation and testing of models for multiple populations. 
The MGDM allows for complex scale linkages that make assessments comparable across 
populations and makes it possible to test whether items function the same in different 
subpopulations. This can be done with known subpopulations (defined by grade levels, 
cohorts, etc.), as well as with unknown subpopulations that need to be identified by the 
model. In both cases, MGDMs make it possible to detennine whether different sets of item- 
by-skill parameters and/or different skill distributions have to be assumed for different 
subpopulations. This amounts to a generalized procedure that can be used to test for 
differential item functioning (DIF) on one item or on multiple-response variables using 
multiple-group or mixture models. This procedure enables testing DIF models against models 
that allow additional skills for certain items in order to account for differences between 
subpopulations. 

Key words: Item response theory, diagnostic models, discrete MIRT, mixture distribution 
models, multiple classification latent class analysis, polytomous data 
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What Are Diagnostic Models? 

Rule space methodology (Tatsuoka, 1983) and latent structure models with multiple 
latent classifications (Goodman, 1974a, 1974b; Habennan, 1979; Haertel, 1989; Maris, 1999) 
represent the most well known early attempts at diagnostic modeling. The noisy-input 
deterministic-and (NIDA) model (Junker & Sijtsma, 2001; Maris 1999) is an example of a 
recently discussed diagnostic model. Similarly, the deterministic-input noisy-and (DINA) 
model, which is a constrained (multiple classification), latent class model has been discussed 
by several authors (Haertel, 1989; Junker & Sijtsma; Macready & Dayton, 1977). More 
recently, the unified model (DiBello, Stout, & Roussos, 1995), which lacks identifiability in 
its original parameterization, underwent modification and was recast as the reparameterized 
unified model (RUM; also referred to as the fusion model or the Arpeggio system ; Hartz, 
Roussos, & Stout, 2002). 

This paper introduces a class of models for cognitive diagnosis, the general diagnostic 
model (GDM; von Davier, 2005a), in its form for multiple populations and discrete mixtures. 
The GDM is based on developments that integrate (located) latent class models (Formann, 
1985; Lazarsfeld & Henry, 1968); multiple classification latent class models (Maris, 1999); 
and discrete, multidimensional item response theory (MIRT) models (Reckase, 1985) into 
one common framework, von Davier showed that the GDM contains several previous 
approaches in addition to some common IRT models as special cases. Similar to previous 
approaches to diagnostic modeling, GDMs describe the probability of a response vector as a 
function of parameters that describe the individual item response variables in tenns of 
required skills and of indirectly observed (latent) skill profiles of respondents. The item-by¬ 
skill requirements are recorded in most diagnostic models as a design matrix that is often 
referred to as a Q-matrix. This Q-matrix consists of rows representing a hypothesis of the 
combination of skills needed to succeed or to obtain partial or full credit in response to a 
particular item. The hypothesized Q-matrix is either the result of experts rating items of an 
existing assessment (retrofitting) or comes directly out of the design of the assessment 
instrument, in which it served as a tool to design the items. In this paper, models referred to 
as the GDM (von Davier, 2005a; von Davier & Yamamoto, 2004a, 2007) will be introduced 
and extended to mixtures and multiple groups, which this paper refers to under the title of the 
mixture general diagnostic model (MGDM). GDMs have been developed to integrate 
multiple-classification, latent class models (Maris, 1999) and located, latent class models 
(Formann, 1985) and may be described as discrete MIRTs (Lord & Novick, 1968; Reckase, 
1985). The MGDM extends the GDM to mixtures of discrete MIRT models (von Davier & 
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Rost, 2006; von Davier & Yamamoto, 2007). Unidimensional mixture IRT models were 
described by Mislevy and Verhelst (1990), Kelderman and Macready (1990), and Rost 
(1990). von Davier and Rost (1995) extended conditional maximum likelihood methods to 
mixture Rasch models for polytomous data, and von Davier and Yamamoto (2004b) 
described the mixture distribution generalized partial credit model (mixture GPCM), an 
extension of the GPCM (Muraki, 1992). 

The extension of the GDM to multiple groups and/or mixtures of populations 
increases the utility of GDMs by making it possible to estimate and test models in such 
settings. For example, the MGDM allows for complex scale linkages to compare assessments 
across populations (compare von Davier & von Davier, 2004), and it enables testing whether 
items are functioning the same in different populations. This can be done with either known 
populations (grades, cohorts, etc.) or with unknown subpopulations that need to be identified 
by the model. In both cases, MGDMs make it possible to test whether different sets of item- 
by-skill parameters and/or different skill distributions have to be assumed for different 
subpopulations. This amounts to a generalized procedure that can be used to test for DIF on 
one item or on multiple response variables using multiple-group or mixture models and to test 
such DIF models against models that identify additional skills for certain items in order to 
account for differences between subpopulations. 

Diagnostic models typically assume a multivariate, but discrete, latent variable that 
represents the absence or presence, or more gradual levels, of multiple skills. These skill 
profdes have to be inferred through model assumptions with respect to how the observed data 
relate to the unobserved skill profile. The absence or presence of skills is commonly 
represented by a Bernoulli (0/1) random variable in the model. Given that the number of 
skills represented in the model is larger than in unidimensional models (obviously greater 
than 2, but smaller than 14 skills in most cases), the latent distribution of skill profiles needs 
some specification of the relationship between skills in order to avoid the estimation of up to 
2 I4 -1 = 16,383 separate skill-pattern probabilities. The GDM (von Davier, 2005a) allows 
ordinal skill levels and different fonns of skill dependencies to be specified so that more 
gradual differences between examinees can be modeled in this framework. 

The following section will introduce the GDM for dichotomous and partial credit data 
and binary as well as ordinal latent skill profiles. Then the MGDM will be introduced. Third, 
scale linkage across multiple groups using GDMs will be discussed. Finally, examples of 
applications of the MGDM in large-scale data analysis will be presented. 


2 



The General Diagnostic Model Framework 

von Davier and Yamamoto (2004a) developed a GDM framework that uses ideas 
from MIRT and multiple-classification and located latent class models The GDM is suitable 
for polytomous items, for dichotomous items, and for mixed items in one or more test forms. 
It enables the modelling of polytomous skills, mastery/nonmastery skills, and pseudo- 
continuous skills, von Davier (2005b) described the partial credit GDM and developed an 
expectation-maximization (EM) algorithm to estimate GDMs. In 2006, this algorithm was 
extended to the estimation of MGDMs. 


General Diagnostic Models for Ordinal Skill Levels 

This section introduces the GDM (developed by von Davier & Yamamoto, 2004a) for 
dichotomous and polytomous data and ordinal skill levels. Diagnostic models can be defined 
by a discrete, multidimensional, latent variable @; in the case of the MGDM, the 
multidimensional skill profile a = (a v ...a K ) consists of discrete, user-defined skill levels 

a k e vY-i j • • Y/ ’ • • •’ >sy /( } _ 

In the simplest (and most common) case, the skills are dichotomous (i.e., the skills 
will take on only two values a k e {0,1} ). In this case, the skill levels are interpreted as 
mastery (1) versus nonmastery (0) of skill k . Let 9 = (a l ,...,a K ) be a K -dimensional skill 

profile consisting of K polytomous skill levels a k’ k = 

The probability of a response x in the general diagnostic model is given by 


P(X i =x\j3 i ,q i ,r i ,a) = - 


exp 




1 + ZIll eX P Pyi + Z L ry* h i (' U > a k ) 


( 1 ) 


with ^-dimensional skill profile a ~ ( a v-> a K) and with some necessary restrictions on the 
Z* ft* and the to identify the model. 

The Q-matrix entries c hk relate item ' to skill ^ and determine whether or not (and to 
what extent) skill ^ is required for item '. If skill ^ is required for item 1 , then c hk > 0. If 
skill k is not required, then L hk = 0. 

The real functions hfdik^k) are a central building block of the GDM. The function 
h, maps the skill levels a k and Q-matrix entries L hk to the real numbers. In most cases, the 
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same mapping will be adopted for all items, so one can drop the index '. The h mapping 
defines how the Q-matrix entries and the skill levels interact (von Davier, 2005a; von Davier, 
DiBello, & Yamamoto, 2006). 

Examples of Skill Level Definitions 

Assume that the number of skill levels is = ^, and choose skill levels 

a k *= {“1-0,+ 1.0} ^ or alternatively a k e {-0.5, +0.5} Ncde that these skill levels are a priori 
defined constants and not model parameters. This setting can be easily generalized to 

polytomous, ordinal skill levels with the number of levels being = m +1 anc i a 
determination of levels such as a k e {(0-c),(l-c),...,(m-c)} f or some constant c . An 
obvious choice is c ~ . 

Consider a case with just one dimension, say K = 1 s and many levels, say S k = 41 ? 
with levels of a k being equally spaced (a common, but not a necessary, choice), say 

a k e {-4.0,...,+4.0} Here, the GDM mimics a unidimensional IRT model, namely the GPCM 
(Muraki, 1992). As a consequence, this IRT-like version of the GDM requires constraints to 
remove the indetenninacy of the scale, just as IRT models do. 

For GDMs with just a few levels per skill, such constraints may not be needed. In the 
(most) common case of two levels per skill, the range of skill levels is counterbalanced by the 

average of slope parameters. For example, a GDM with a k e {-1-0,+1.0} produces slope 

parameters that are half as big as a GDM that uses a k e {-0.5,+0.5} as s ]<jl] levels. This case 
does not require constraints, as just one proportion detennines the mean and variance of a 
binary variable. 

von Davier and Yamamoto (2004a) showed that the GDM already contains a 
compensatory version of the fusion model as well as many common IRT models as special 

cases. The parameters Pxi may be viewed as item difficulties in the dichotomous case and as 

threshold parameters in the polytomous case, and the Yxtk may be interpreted as slope 
parameters. 

General Diagnostic Models for Partial Credit Data 

For a partial credit version of the GDM, choose hfdik^k) = 9,/A with a binary (0/1) 
Q-matrix. The resulting model contains many standard IRT models and their extensions to 
confirmatory MIRT models using Q-matrices. This GDM may be viewed as a multivariate, 
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discrete version of the GPCM. For a response x e {0,1,2 ; the model based probability 
in this GDM is 


P(X i =x\/3 i ,a,q i ,y i ) = - 


exp 


P* + Zt, x Y*<lik a k 


Z : CX P 


Pyi + Z 


k=l yr ik q ik <* k 


( 2 ) 


with ^ attributes (discrete latent traits) a ~ ( a v-’ a K ) and a dichotomous design Q-matrix 

(<7«:) i=1 ^. These a t are discrete scores determined before estimation and can be chosen 

by the user. These scores are used to assign real numbers to the skill levels; for example, 
«(0) = ~l-0 and a (l) = +1-0 may be chosen for dichotomous skills. 

For a vector of item responses, local stochastic independence (LI) is assumed, which 

yields 


P{X = x P,a,Q,y) = fj P{X = x t \0 i ,a,q i ,y i ) 


for a vector of item responses * - C*i,•••,*/), a Q-matrix Q, and a skill profile a.-{a x ,...,a K ), 

as well as matrix-valued item difficulties J3 and slopes y. The marginal probability of a 
response vector is given by 


P(X = x P,Q,h = Pi^q^ri) 


i =1 


which is the sum over all skill patterns a - ( a \,—, a K ), assuming that the discrete count 

density of the skill distribution is n - a ~ P(^ = a) . 

De la Torre and Douglas (2004) estimated the dichotomous version of this model, the 
linear logistic model (LLM; Hagenaars, 1993; Maris, 1999), using Markov chain Monte- 

Carlo (MCMC) methods. For ordinal skills with s k levels, the a k may be defined using 
a(x) = x for x = -1) or a(0) = -s k /2,...,a(s k -1) = s k l 2 parameters of the 

models as given in Equation 2 can be estimated for dichotomous and polytomous data, as 
well as for ordinal skills, using the EM algorithm. 
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An Example of a Simple Diagnostic Model 

How do diagnostic models look? In the following example, there are two 
hypothesized skills, a dichotomous mastery/non-mastery skill e {-1,1} and an ordinal 
skill, ^2 e {-2,-1,0,1,2} ? w dh fi ve proficiency levels. 

In addition, there are seven observed variables, referred to as the item response 
variables in psychometric models and models for educational measurement. In this example, 
we assume that a mixed format set of three dichotomous items, XX...3 e {0,1} ; and four 
polytomous items, 2CA...X1 e {0,1,2,3} ? j s observed. 

The Q-matrix, which relates items to the underlying skill variables, has two columns, 
one each for the two skills T1 and T2, and seven rows. The Q-matrix for this example may 
look like 


n 


Q = 


l 

l 

o 

l 

0 

1 ° 


0^ 

1 

0 

1 

1 

1 

b 


(3) 


which indicates that skill T1 is required for items XI, X2, X3, and X5, but not for the 
remaining items. Skill T2 is required for items X2, X4, X5, X6, and X7, but not for items XI 
and X3. An illustration of Equation 3 is shown in Figure 1. 



Figure 1. A graph of the example diagnostic model. 
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Models such as the one depicted in Figure 1 implicitly assume that the same structure 
holds for all examinees in the population from which the observations are sampled. More 
specifically, when using the Q-matrix given in Equation 3 and the model equation given in 
Equation 2, one assumes that this structure holds for all examinees in the population. 

Mixtures of General Diagnostic Models 

The GDM can also be extended also to a mixture-distribution IRT model (von Davier 
& Rost, 2006; see also the corresponding sections below). This allows for the estimation of 
this class of diagnostic model in different latent classes without prespecifying which 
observation belongs to which class and provides the ability to check whether the same kind of 
skill-by-item relationships holds for all the subjects sampled from a particular population. A 
multiple-group version of the GDM can also be specified and estimated using the algorithm 
described below. This allows the estimation of diagnostic models that contain partially 
missing grouping information (similar to the approach described in von Davier & Yamamoto, 
2004b). For diagnostic models involving multiple observed groups or multiple unobserved 
populations (latent classes), parameter constraints can be specified that ensure scale linkages 
across these populations. The MGDM is 


P{X i =x\P i ,a,q i ,y i ,g) = 


exp 


P xig + ZL x rag9ik 0 ( a k) 


1 +Z'"l, ex P Pm +ZZ yy^Sufiici,) 


with parameters as defined above and added group index 8 . This model allows the 
estimation of separate model parameters in the g separate groups. The groups may be defined 
by an observed-group indicator variable; in this case, the above model is the diagnostic model 
equivalent of a multiple-group IRT model (Bock & Zimowski, 1997). If the groups are 
unobserved and have to be inferred during estimation, the above model is a discrete mixture 
diagnostic model (see von Davier & Rost, 2006; von Davier & Yamamoto, 2007). [ 

The marginal probability of a response vector in the MGDM is 


/ 



with cube-valued (classes g times items i times categories x) item difficulties, 
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and cube-valued (classes g times items i times skills k) slope parameters, 


7 W gik)g=l..g;i=l..I;k=l..K ’ 

and a (0/1) Q-matrix Q. Let n g denote the relative class or group sizes and 71 - ae ~ I S) 

denote the class- or group-specific distribution of skill patterns a in group g. 

Figure 2 illustrates a diagnostic model in multiple populations. This figure indicates 
that the item parameters and skill distributions are modelled separately in the different 
instances of the grouping variable g by providing a separate graph for each group. 

Instead of separate graphs for separable populations, groups, or classes, the 
dependency of the model parameters on the group indicator can be illustrated by adding some 
arrows to the diagnostic model graph in Figure 1. Figure 3 presents the multiple-group 
variant in this manner. In this figure, the circles in shades of gray targeted by the 
Group?Class population indicator variable represent the dependency of item parameters on 
the population. 

In Figure 3, all arrows originating from latent variables T1 and T2 include a gray 
circle that indicates the population dependency of the item parameters. Originating from a 
new variable (Group?Class) are the arrows that target these ‘population dependency’ 
indicators. In addition, the distribution of latent variables T1 and T2 are on the receptive end 
of arrows originating from the group indicator variable (Group?Class), indicating that the 
latent trait distributions for T1 and T2 may also vary across populations. 

Multiple-group and mixture models for item response data may be used to study how 
different two or more populations are by looking at how parameter sets from the model differ 
for different groups or subpopulations. These models are useful to separate samples into 
groups employing different strategies to solve items (Kelderman & Macready, 1990; Rost & 
von Davier, 1993). Researchers have used these models to identify response styles and faking 
(Eid & Zickar, 2007; Rost, Carstensen, & von Davier, 1996, 1999). Rijmen and DeBoeck 
(2003) studied the relationship of mixture IRT models to MIRT (Reckase, 1985). More 
generally, mixture models can be used to test whether a unidimensional IRT model is 
appropriate for the data at hand (Rost & von Davier, 1995). 



T1 


T2 


'sO 



Figure 2. A graph of a multiple-population or mixture diagnostic model. 






















Multigroup Model with Group Specific Item Parameters 

Figure 3. An alternative graph of a multiple-group or mixture diagnostic model. 

Scale Linkages Across Mixture/Multiple-Group Diagnostic Models 

In the unconstrained case, all item parameters may differ across the subpopulations in 
an MGDM. This is not always desirable because comparisons across groups require some 
common interpretation of the parameters involved. This is usually interpreted as meaning the 
parameters have to be on the same IRT scale. Scale linkage in IRT models enables the 
comparison of ability estimates across different populations (see, e.g., von Davier & von 
Davier, 2004; Kolen & Brennan, 1995). The scale indeterminacy of IRT models makes these 
models invariant under appropriate linear transformations of the parameters involved so that 
parameter estimates of common items can be transformed (or constrained) to match certain 
objectives. The objective to be met is either matching moments of the item parameters shared 
across forms or populations or setting equal the common items’ parameters across forms or 
populations (von Davier & von Davier). This objective can be accomplished by employing 
constrained maximum likelihood estimation or by maximizing a modified likelihood that 
adds a penalty term or a Lagrange multiplier (Aitchison & Silvey, 1958). 

In MGDMs, comparisons across subpopulations are made possible in the same way 
items are constrained in IRT scale linkages. The most stringent comparisons are made 
possible by assuming that the same item parameters hold for a set of common items across 
subpopulations. In a graph, arrows originating from the group indicator mean that the 
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targeted parameter depends on the group indicator, while the absence of arrows mean that the 
targeted parameter is independent of the grouping indicator £ (i.e., the absence of arrows 
indicates that constraints on item parameters are to be equal across subpopulations). 

Figure 4 illustrates this sort of linkage in MGDMs. The items without a direct arrow 
originating from the ellipse labelled group are items X2, X4, and X5; these items have the 
same parameters across subpopulations. Items XI, X3, X6, and X7 have group-dependent 
parameters, which are indicated by arrows originating from the group ellipse. The same holds 
for the skill variables ^ and ^2 as W ell as their covariance. The distribution of these 
variables and their relationship do not vary across subpopulations, which is indicated by an 
arrow originating from the group ellipse. 



General Diagnostic Model with Group Specific and Unspecific Item Parameters 

Figure 4. A mixture/multiple-group general diagnostic model with equality constraints. 

The mdltm software (von Davier, 2005b) allows for the definition of equality 
constraints across pairs of items or multiple items in different subpopulations, as well as 
constraints that affect only difficulties or slopes in MGDMs. In addition, parameter 
constraints can be employed that fix item parameters to certain values, for example, to 
parameter values from previous calibrations. Other scale linkages such as the mean-variance 
methods used in unidimensional IRT (Loyd & Hoover, 1980; Marco, 1977) are also available 
for estimating li nk ed GDMs in several populations by invoking corresponding key words in 
the mdltm scripting language. 
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In addition to scale-linkage methods that mirror traditional methods used in IRT, 
parameter constraints for MGDMs can be used to develop new methods of scale linkage and 
even new models within the class of MGDMs as outlined next. 

Different Q-Matrices in Different Populations 

The same skill-by-item structure implied by the Q-matrix might not be appropriate for 
all subpopulations. Imagine that different student groups receive test preparation from 
different vendors, so that some students are trained to use additional methods to make sure 
their responses are correct. In this case, different Q-matrices might hold in different 
subpopulations since some student groups are trained to use this additional method, and may 
do so to with varying success. The other students most likely do not know about this method, 
since they have not been trained in using it. 

In the framework of MGMs, this can be implemented as follows using the methods of 
parameter constraints offered by mdltm: Define a super Q-matrix with entries of 1 if a skill is 
needed for an item in at least one subpopulation and set the Q-matrix to 0 only if the skill is 
not required for an item in all subpopulations. Then fix slope parameters to equal 0 for skills 
that are not needed in certain subpopulations for certain items. This ensures that no slope is 
estimated in these subpopulations as the slope has been set to equal 0. In these 
subpopulations, the corresponding skill (with the slope equalling 0 for certain items) does not 
contribute to items constrained in that way. 

In the next step, the fit of these constrained models with unique Q-matrices across 
subpopulations may be compared to models that do not impose such constraints. This will 
provide evidence on how appropriate are the assumptions that lead to a specific constrained 
model. 

Strongest Form of Linkage Across Multiple Populations 

Another important case of a constrained model for multiple populations is a multiple- 
group model where all (common) items are assumed to have the same parameters in all 
subpopulations. This means that while each common item may have a parameter that differs 
from other items in the same population, the common item is assumed to have the same 
parameters across populations (i.e., different administrations, different cohorts). Only the 
ability distributions differ across subpopulations. For instance, the ability distribution in the 
example is I g ), where g stands for the group or population under consideration. 
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This form of the model measures identical skills, allowing for different skill distributions 
across subpopulations. 

The rationale for a multiple-group model that includes just one set of item parameters 
is the assumption of measurement invariance, in the sense that the item’s conditional 
response probability depends on a unidimensional, person-specific variable only. Given the 
value of this variable (e.g., the skill, ability, or proficiency of an examinee) and knowing the 
item characteristics or parameters, the response probability is determined without respect to 
which group the examinee belongs. Figure 5 illustrates this form of equality constraint across 
subpopulations; note that the arrows originating from the group ellipse target the skill 
distribution variables only, and no arrow targets the gray bubbles, which represent the item 
characteristics. 



Multigroup Model with Group Unspecific Item Parameters 


Figure 5. Strongest form of linkage across multiple populations. 

This measurement-invariant MGDM assumes that item characteristics are exactly the 
same across groups, while allowing skill distribution differences across groups. Other models 
are easily obtained by varying the types of constraints presented in this paper. 

Applications 

GDMs have been applied in research studies conducted for different large scale 
assessment programs. This section gives a brief summary of three applications to data from 
such programs. 
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General Diagnostic Models and English Language Testing 

von Davier (2005a) analyzed TOEFL 8 iBT pilot data with common IRT models and 
GDMs using expert-generated Q-matrices. It was hypothesized that four skills each could be 
identified in the listening and in the reading sections of the TOEFL iBT pilot data. In contrast 
to this expectation, it was found that a unidimensional mixed two-parameter logistic 
(2PL)/GPCM IRT model fits the data as well as the GDM with four dichotomous 
mastery/nonmastery skills. 

For reasons of parsimony, von Davier (2005a) concluded that the 2PL/GPCM was to 
be favored for both the listening and the reading section. Figure 6 shows the relation between 
the 2PL ability estimate and the skill-mastery probabilities for the listening section data of 
this study. Figure 6 shows very similar results for the two forms of the TOEFL iBT, Form A 
and Form B, used in von Davier’s study. It is evident that all four skills have a rather strong 
relationship to the overall 2PL parameter estimate. The probability of skill mastery increases 
in a very systematic fashion with increasing 2PL parameter. The width of the four S-shaped 
plots is mainly a function of reliability of the skill-mastery probability. If the skill is 
measured by many items, the S-shaped curve is narrower; if few items are used to measure 
the skill, the S-shape is a little wider. 



Skill 1 Form B 



-1.5 -0.5 0.0 0.5 1.0 1.5 

2PL theta estimate 


Skill 2 Form B 



-1.5 -0.5 0.0 0.5 1.0 1.5 

2PL theta estimate 


Skill 3 Form B 



-1.5 -0.5 0.0 0.5 1.0 1.5 

2PL theta estimate 


Skill 4 Form B 



-1.5 -0.5 0.0 0.5 1.0 1.5 

2PL theta estimate 


Figure 6. English language general diagnostic model, listening Forms A and B. 


In additional analyses, mdltm was used to test a unidimensional IRT model, a two- 
dimensional IRT model employing the 2PL/GPCM, and a GDM. Each model contained all 
eight skills (four for reading, four for listening) in one Q-matrix and was composed of the 
joint listening and reading parts of the TOEFL iBT pilot data. It was found that the two- 
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dimensional discrete IRT model estimated in the GDM framework provides the best data 
description in terms of balancing parsimony and model-data fit. 

MGDMs for Matrix Samples of Item Responses 

Xu and von Davier (2006) used a multiple-group GDM with large-scale survey data. 
In their example, gender and race/ethnicity were used as grouping variables. Xu and von 
Davier used data from the 2002 12th-grade National Assessment of Educational Progress 
(NAEP; for the history of the national assessment, see Jones & Olkin, 2004) in reading and 
mathematics. The reading data were modeled using MGDMs with up to three dimensions, 
and the mathematics data were modeled using MGDMs with up to seven dimensions (four 
content domains plus three complexity levels), because data from large-scale surveys are 
extremely sparse in nature, the authors performed a parameter-recovery study based on 
estimating GDMs in sparse samples of item responses. 

The results are reported in detail in Xu and von Davier (2006). The parameter- 
recovery results under different levels of sparseness of data support the feasibility of 
estimating GDMs under such conditions. Table 1 presents results of this study, making use of 
the average bias and the root mean square error obtained under different degrees of data 
sparseness. 

Table 1 

Bias and RMSE of GDM Item Difficulties and Slope Parameter and Skill Distribution 
Probability Estimates 




Percentage of missing data 


Measure 

10% 

25% 

50% 

Item parameters 

Average bias 

0.001 

0.002 

0.005 


Average RMSE 

0.071 

0.083 

0.119 

Skill distribution 

Average bias 

0.000 

0.000 

0.000 


Average RMSE 

0.004 

0.004 

0.007 


The results reported by Xu and von Davier (2006) on the NAEP data showed that a 
multidimensional MGDM (both single-group and multiple-group versions of the GDM were 
tested) was found to fit the reading data consistently better than a unidimensional IRT model. 
However, a unidimensional IRT model fit the math data better than a three-, four-, or even a 
seven-dimensional GDM. This result has since been replicated using other larger NAEP data 
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sets and can be explained by the fact that the reading domains correlate less than do the math 
subscales when defined by either content or complexity factors. 

Mixture IRT, General Diagnostic, and Latent Class Models 

Huang and von Davier (2006) used background data from an international large-scale 
assessment of adult literacy. Their results were based on data from approximately 47,000 
adults assessed with cognitive adult literacy scales and background questionnaires. The 
sample contained data from seven countries. 

The goal of this study was to develop indicator variables using latent class models, 
GDMs, or common IRT models. The purpose of these derived indicator variables was to 
provide more reliable background variables for secondary data analysis by aggregating 
response data. 

In this study, the three models listed above fit the relatively short scales equally. Measures 
of model-data fit made it evident that a distinction between discrete and continuous latent 
variables was difficult to make if only a few observed variables were used. 

Conclusions and Outlook 

This paper introduces MGDMs and presents evidence for the utility of this class of 
models. So far, examples of successful applications of the GDM and its mixture 
generalizations come from data analyses aimed at identifying the necessary level of 
complexity needed to fit observed responses and exploring multiple-group versus single¬ 
group models as examples of scale linkages across multiple populations. 

Obvious next steps include the introduction of covariates for predicting skill 
distributions. One common way to do this is to extend the GDM using a latent regression 
model—a conditioning model in the language of NAEP and other large-scale survey 
assessments (von Davier, Sinharay, Oranje, & Beaton, 2006). Figure 7 illustrates this model 
extension with the example used in previous figures. 

Xu and von Davier (2006) developed parametric skill-distribution models for GDMs. 
These parametric families of discrete skill distributions enable the skill space to be modeled 
more parsimoniously, so that models with a larger skill count are still estimable even when 
the sample sizes in the different subpopulations are not large. These extensions have been 
implemented in mdltm and can be estimated with customary maximum likelihood methods. 
These developments are currently being studied using a variety of large-scale data sets. 
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Next: Model with Latent Regression using Background Data 


Figure 7. Extending the GDM using a latent regression model. 

However, even with computationally efficient methods and fast computers that enable 
complex models to be estimated in a reasonable time, the question of model complexity 
remains important. For this reason, research on model-data fit and the balance between 
parsimony of models is imperative (see Haberman & von Davier, 2006). 

Mixture general diagnostic models are a useful tool for educational measurement 
research. The potential of these models for practical large-scale data analysis lies in the fact 
that models of different complexity can be specified within a common framework, estimated 
using standard maximum likelihood methods, and directly compared in terms of their 
predictive power. 
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